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Abstract 

Code to implement network protocols can be ei¬ 
ther inside the kernel of an operating system or in 
user-level processes. Kernel-resident code is hard to 
develop, debug, and maintain, but user-level im¬ 
plementations typically incur significant overhead 
and perform poorly. 

The performance of user-level network code 
depends on the mechanism used to demultiplex 
received packets. Demultiplexing in a user-level 
process increases the rate of context switches and 
system calls, resulting in poor performance. Demul¬ 
tiplexing in the kernel eliminates unnecessary over¬ 
head. 

This paper describes the packet filter , a kernel- 
resident, protocol-independent packet demul¬ 
tiplexer. Individual user processes have great 
flexibility in selecting which packets they will 
receive. Protocol implementations using the packet 
filter perform quite well, and have been in produc¬ 
tion use for several years. 


1. Introduction 

It is not always appropriate to implement networking 
protocols inside the kernel of an operating system. Al¬ 
though kernel-resident network code can often outper¬ 
form a user-level implementation, it is usually harder to 
implement and maintain, and much less portable. If 
optimal performance is not the primary goal of a 
protocol implementation, one might well prefer to im¬ 
plement it outside the kernel. Unfortunately, in most 
operating systems user-level network code is doomed to 
terrible performance. 
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In this paper we show that it is possible to get ade¬ 
quate performance from a user-level protocol im¬ 
plementation, while retaining all the features of user- 
level programming that make it far more pleasant than 
kernel programming. 

The key to good performance is the mechanism used 
to demultiplex received packets to the appropriate user 
process. Demultiplexing can be done either in the ker¬ 
nel, or in a user-level switching process. User-mode 
demultiplexing allows flexible control over how pack¬ 
ets are distributed, but is expensive because it normally 
involves at least two context switches and three system 
calls per received packet. Kernel demultiplexing is ef¬ 
ficient, but in existing systems the criteria used to dis¬ 
tinguish between packets are too crude. 

This paper describes the packet filter , a facility that 
combines both performance and flexibility. The packet 
filter is part of the operating system kernel, so it 
delivers packets with a minimum of system calls and 
context switches, yet it is able to distinguish between 
packets according to arbitrary and dynamically variable 
user-specified criteria. The result is a reasonably ef¬ 
ficient, easy-to-use abstraction for developing and run¬ 
ning network applications. 

The facility we describe is not a paper design, but the 
evolutionary result of much experience and tinkering. 
The packet filter has been in use at several sites for 
many years, for both development and production use 
in a wide variety of applications, and has insulated 
these applications from substantial changes in the un¬ 
derlying operating system. It has been of clear practical 
value. 

In section 2, we discuss in greater detail the motiva¬ 
tion for the packet filter. We describe the abstract inter¬ 
face in section 3, and briefly sketch the implementation 
in section 4. We then illustrate, in section 5, some uses 
to which the packet filter has been put, and in section 6 
discuss its performance. 



2. Motivation 

Software to support networking protocols has be¬ 
come tremendously important as a result of use of LAN 
technology and workstations. The sheer bulk of this 
software is an indication that it may be overwhelming 
our ability to create reliable, efficient code: for ex¬ 
ample, 30% of the 4.3BSD Unix [8, 21] kernel source, 
25% of the TOPS-20 [10] (Version 611) kernel source, 
and 32% of the V-system [4] kernel source are devoted 
to networking. 

Development of network software is slow and sel¬ 
dom yields finished systems; debugging may continue 
long after the software is put into operation. Continual 
debugging of production code results not only from 
deficiencies in the original code, but also from in¬ 
evitable evolution of the protocols and changes in the 
network environment. 

In many operating systems, network code resides in 
the kernel. This makes it much harder to write and 
debug: 

• Each time a bug is found, the kernel must be 
recompiled and rebooted. 

• Bugs in kernel code are likely to cause system 
crashes. 

• Functionally independent kernel modules may 
have complex interactions over shared resources. 

• Kernel-code debugging cannot be done during 
normal timesharing; single-user time must be 
scheduled, resulting in inconvenience for 
timesharing users and odd work hours for system 
programmers. 

• Sophisticated debugging and monitoring facilities 
available for developing user-level programs may 
not be available for developing kernel code. 

• Kernel source code is not always available. 

In spite of these drawbacks, network code is still 
usually put in the kernel because the drawbacks of put¬ 
ting it outside the kernel seem worse. If a single user- 
level process is used for demultiplexing packets, then 
for each received packet the system will have to switch 
into the demultiplexing process, notify that process of 
the packet, then switch again as the demultiplexing 
process transfers the packet to the appropriate destina¬ 
tion process. (Figure 2-1 depicts the costs associated 
with this approach.) Context switching and inter¬ 
process communication are usually expensive, so 
clearly it would be more efficient to immediately 
deliver each packet to the ultimate destination process. 
(Figure 2-2 shows how this approach reduces costs.) 
This requires that the kernel be able to determine to 
which process each packet should go; the problem is 
how to allow user-level processes to specify which 
packets they want. 

One simple mechanism is for the kernel to use a 
specific packet field as a key; a user process registers 
with the kernel the field value for packets it wants to 


receive. Since the kernel does not know the structure of 
higher-layer protocol headers, the discriminant field 
must be in the lowest layer, such as an Ethernet [9] 
“type” field. This is not always a good solution. For 
example, in most environments the Ethernet type field 
serves only to identify one of a small set of protocol 
families; almost all packets must be further dis¬ 
criminated by some protocol-specific field. If the ker¬ 
nel can demultiplex only on the type field, then one 
must still use a user-level switching process with its 
attendant high cost. 



Figure 2-1: Costs of demultiplexing in a user process 



Figure 2-2: Costs of demultiplexing in the kernel 

The packet filter is a more flexible kernel-resident 
demultiplexer. A user process specifies an arbitrary 
predicate to select the packets it wants; all protocol- 
specific knowledge is in the program that receives the 
packets. There is no need to modify the kernel to sup¬ 
port a new protocol. This mechanism evolved for use 
with Ethernet data-link layers, but will work with most 
similar datagram networks. 

The packet filter not only isolates the kernel from the 
details of specific protocols; it insulates protocol code 
from the details of the kernel implementation. The 
packet filter is not strongly tied to a particular system; 
in its Unix implementation, it is cleanly separated from 
other kernel facilities and the novel part of the user- 
level interface is not specific to Unix. Because protocol 
code lives outside the kernel it does not have to be 





modified to be useful with a wide variety of kernel 
implementations. In systems where context-switching 
is inexpensive, the performance advantage of kernel 
demultiplexing will be reduced, but the packet filter 
may still be a good model for a user-level demultiplexer 
to emulate. 

In addition to the cost and inconvenience of demul¬ 
tiplexing, the cost of domain crossing whenever control 
crosses between kernel and user domains has dis¬ 
couraged the implementation of protocol code in user 
processes. In many protocols, far more packets are ex¬ 
changed at lower levels than are seen at higher levels 
(these include control, acknowledgement, and duplicate 
packets). A kernel-resident implementation confines 
these overhead packets to the kernel and greatly reduces 
domain crossing, as depicted in figure 2-3. The packet 
filter mechanism cannot eliminate this problem; we can 
reduce it through careful implementation and by batch¬ 
ing together domain-crossing events (see section 3). 



Figure 2-3: Kernel-resident protocols 
reduce domain-crossing 


User-level access to the data-link layer is not univer¬ 
sally regarded as a good thing. Some have suggested 
that user programs never need access to explicit net¬ 
work communication [23]; others might argue that all 
networking should be done within a transport protocol 
such as IP [19] or the ISO Transport Protocol [15], with 
demultiplexing done by the transport layer code. Both 
these arguments implicitly assume a homogeneous net¬ 
working environment, but heterogeneity is often a fact 
of life: machines from different manufacturers speak 
various transport protocols, and research on new 
protocol designs at the data-link level is still profitable. 

The packet filter allows rapid development of net¬ 
working programs, by relatively inexperienced 
programmers, without disrupting other users of a 
timesharing system. It places few constraints on the 
protocols that may be implemented, but in spite of this 
flexibility it performs well enough for many uses. 


2.1. Historical background 

As far as we are aware, the idea (and name) of the 
packet filter first arose in 1976, in the Xerox Alto [3]. 
Because the Alto operating system shared a single ad¬ 
dress space with all processes, and because security was 
not important, the filters were simply procedures in the 
user-level programs; these procedures were called by 
the packet demultiplexing mechanism. The first Unix 
implementation of the packet filter was done in 1980. 

3. User-level interface abstraction 

Figure 3-1 shows how the packet filter is related to 
other parts of a system. Packets received from the net¬ 
work are passed through the packet filter and dis¬ 
tributed to user processes; code to implement protocols 
lives in each process. Figure 3-2 shows, for contrast, 
how networking is done in “vanilla” 4.3BSD Unix; 
protocols are implemented inside the kernel and data 
buffers are passed from protocol code to user processes. 
Figure 3-3 shows how both models can coexist; some 
programs may even use both means to access the net¬ 
work. 



Figure 3-1: Relationship between packet filter 
and other system components 

The programmer’s interface to the packet filter has 
three major components: packet transmission, packet 
reception, and control and status information. We 
describe these in the context of the 4.3BSD Unix im¬ 
plementation. 

Packet transmission is simple; the user presents a 
buffer containing a complete packet, including data-link 
header, to the kernel using the normal Unix write sys¬ 
tem call; control returns to the user once the packet is 
queued for transmission. Transmission is unreliable if 
the data link is unreliable; the user must discover trans¬ 
mission failure through lack of response rather than an 
explicit error. 












Figure 3-2: 4.3BSD networking model 



Figure 3-3: Packet filter coexisting with 
4.3BSD networking model 

Packet reception is more complicated. The packet 
filter manages some number of ports , each of which 
may be opened by a Unix program as a “character spe¬ 
cial device.” Associated with each port is a filter, a 
user-specified predicate on received packets. If a filter 
accepts a packet, the packet is queued for delivery to 
the associated port. A filter is specified using a small 
stack-based “language,” in which one can push ar¬ 
bitrary constants or words from the received packet, 
and apply binary operations to the top of the stack. The 
filter language is discussed in more detail in section 3.1. 
A process binds a filter to a port using an ioctl system 
call; a new filter can be bound at any time, at a cost 
comparable to that of receiving a packet; in practice, 
filters are not replaced very often. 

Two processes implementing different communica¬ 
tion streams under the same protocol must specify 
slightly different predicates so that packets are 
delivered appropriately. For example, a program im¬ 
plementing a Pup [2] protocol would include a test on 
the Pup destination socket number as part of its predi¬ 
cate. The layering in a protocol architecture is not 
necessarily reflected in a filter predicate, which may 
well examine packet fields from several layers. 


When a program performs a read system call on the 
file descriptor corresponding to a packet filter port, the 
first of any queued packets is returned. The entire 
packet, including the data-link layer header, is returned, 
so that user programs may implement protocols that 
depend on header information. The program may ask 
that all pending packets be returned in a batch; this is 
useful for high-volume communications because it can 
amortize the overhead of performing a system call over 
several packets. Figure 3-4 depicts per-packet over¬ 
heads without batching; figure 3-5 shows how these are 
reduced when batching is used. 


Network Kernel 


Destination 

process 



Figure 3-4: Delivery without 

received-packet batching 



Figure 3-5: Delivery with received-packet batching 


If no packets are queued, the read system call blocks 
until a packet is available; if no packet arrives during a 
timeout period, the read call terminates and reports an 
error. Simple programs can be written using a “write; 
read with timeout; retry if necessary” paradigm. More 
elaborate programs may take advantage of two more 
sophisticated synchronization mechanisms: the 4.3BSD 
select system call, or a interrupt-like facility using Unix 





















“signals,” either of which allows non-blocking net¬ 
work I/O. 

3.1. Filter language details 

The heart of the packet filter is an interpreter for the 
“language” shown in figure 3-6. A filter is a data 
structure including an array of 16-bit words. Each word 
is normally interpreted as an instruction with two fields, 
a stack action field and a binary operation field. 

A stack action may cause either a word from the 
received packet or a constant to be pushed on the stack. 
A binary operation pops the top two words from the 
stack, and pushes a result. Thus, filter programs 
evaluate a logical expression composed of tests on the 
values of various fields in the received packet. The 
filter is normally evaluated until the program is ex¬ 
hausted. If the value remaining on top of the stack is 
non-zero, the filter is deemed to have accepted the 
packet. 

It is sometimes possible to avoid evaluating the entire 
filter before deciding whether to accept a packet. This 
is especially important for performance, since on a busy 
system several dozen filters may be applied to an in¬ 
coming packet before it is accepted. The filter language 
therefore includes four “short-circuit” binary logical 
operations, that when evaluated either push a result and 
allow the program to continue, or terminate the 
program and return an appropriate boolean. 

Figure 3-8 shows an example of a simple filter 
program; figure 3-9 shows an example of a filter 
program using short-circuit operations. Both are used 
with Pup [2] packets on a 3Mbit/sec. Experimental 
Ethernet [17]; the data-link header is 4 bytes (two 
words) long, with the packet type in the second word 
(see figure 3-7.) In normal use, the filters are not 
directly constructed by the programmer, but are 
“compiled” at run time by a library procedure. 

The design of the filter language is not the result of 
careful analysis but rather embodies several accidents 
of history, such as its bias towards 16-bit fields. It has 
evolved over the years; in particular, the short-circuit 
operations were added after an analysis showed that 
they would reduce the cost of interpreting filter predi¬ 
cates. One could imagine alternatives to the stack lan¬ 
guage structure; for example, a predicate could be an 
array of (field-offset , expected-value) pairs, and the 
predicate would be satisfied if all the specified fields 
had the specified values. However, the additional 
flexibility of the stack language has often proved useful 
in constructing efficient filters. Since the “instruction 
set” is implemented in software, not hardware, there is 
no execution-time penalty associated with supporting a 
broad range of operations. 



10 bits 

6 bits 

First 

word: 

Binary Operator 

Stack Action 


16 bits 


Second 

word: 

Literal constant 


(second word used only if Stack Action - PUSHLIT) 

Instruction format 


Stack Action 


Effect on stack 


NOPUSH 

PUSHLIT 

PUSHZERO 

PUSHONE 

PUSHFFFF 

PUSHFFOO 

PUSHOOFF 

PUSHWORD+ai 


None 

Following instruction word 
is pushed 

Constant zero is pushed 
Constant one is pushed 
Constant OxFFFF is pushed 
Constant OxFFOO is pushed 
Constant OxOOFF is pushed 
nth word of packet is pushed 


All binary operations except NOP remove two words 
from the top of the stack and push one result word. In 
the table that follows, the original top of stack is ab¬ 
breviated Tl, the word below that is T2, and the result is 
R. For logical operations (AND, OR, XOR), a value is 
interpreted as TRUE if it is non-zero. 


Binary 


Result on stack 


EQ 

R 

= TRUE if T2 == Tl, else FALSE 

NEQ 

R 

= TRUE if T2 <> Tl, else FALSE 

LT 

R 

= TRUE if T2 < Tl, else FALSE 

LE 

R 

= TRUE if T2 <= Tl, else FALSE 

GT 

R 

= TRUE if T2 > Tl, else FALSE 

GE 

R 

= TRUE if T2 >= Tl, else FALSE 

AND 

R 

= T2 AND Tl 

OR 

R 

= T2 OR Tl 

XOR 

R 

= T2 XOR Tl 


NOP 


No effect on stack 


The following “short-circuit’* binary operations all 
evaluate R := (Tl == T2) and push the result R on the 
stack. They return immediately under specified con¬ 
ditions, otherwise the program continues. 


Binary 

operation 


COR 
CAND 
CNOR 
CNAND 


Returns 

immediately 


if result is 


TRUE 

FALSE 

FALSE 

TRUE 


TRUE 

FALSE 

TRUE 

FALSE 


Figure 3-6: Packet filter language summary 
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Figure 3-7: Format of Pup Packet header 
on 3Mb Ethernet (after [2]) 


This filter accepts all Pup packets with Pup Types 
between 1 and 100. The Pup Type field is a one byte 
field, so it must be masked out of the appropriate word in 
the packet. 

struct enfilter f = { 

10, 12, /* priority and length */ 

PUSHWORD+1, PUSHLIT | EQ, 2, 

/* packet type == PUP */ 
PUSHWORD+3, PUSH00FF | AND, 

/* mask low byte */ 

PUSHZERO ! GT, 

/* PupType > 0 */ 

PUSHWORD+3, PUSH00FF | AND, 

/* mask low byte */ 

PUSHLIT | LE, 100, 

/* PupType <= 100 */ 

AND, /* 0 < PupType <= 100 */ 

AND /* && packet type =- PUP */ 

}; 


Figure 3-8: Example filter program 


This filter accepts Pup packets with a Pup DstSocket 
field of 35. The DstSocket field occupies two words, so 
the filter must test both words and combine them with an 
AND operation. The DstSocket field is checked before 
the packet type field, since in most packets the DstSocket 
is likely not to match and so the short-circuit operation 
will exit immediately. 

struct enfilter f = { 

10, 8, /* priority and length */ 

PUSHWORD+8, PUSHLIT | CAND, 35, 

/* Low word of socket == 35 */ 
PUSHWORD+7, PUSHZERO | CAND, 

/* High word of socket == 0 */ 
PUSHWORD+1, PUSHLIT | EQ, 2 

/* packet type == Pup */ 

}; 


Figure 3-9: Example filter program 

using short-circuit operations 


3.2. Access Control 

Normally, once a packet has been accepted for 
delivery to a process, it will not be submitted to the 
filters of any other processes. Because it is not possible 
to determine if two filters will accept overlapping sets 
of packets, we need a way to prevent one process from 
inappropriately diverting packets meant for another 
process. 

Associated with each filter is a priority ; filters are 
applied in order of decreasing priority, so if two filters 
would both accept a packet, it goes to the one with 
higher priority. (Priority has another purpose; if 
priorities are assigned proportional to the likelihood 
that a filter will accept a packet, then the “average” 
packet will match one of the first few filters it is tested 
against, consequently reducing the amount of filter in¬ 
terpretation overhead.) If two filters have the same 
priority, the order of application is unspecified (the in¬ 
terpreter may occasionally reorder such filters to place 
the busier ones first); in these cases one must take care 
to ensure that the filters accept disjoint sets of packets. 

Optionally, a process may specify that the packets 
accepted by its filter should be submitted to other, 
lower-priority, filters as well; multiple copies of such 
packets may be delivered. This is useful in implement¬ 
ing monitoring facilities without disturbing the 
processes being monitored, in “group” communication 
where a packet may be multicast to several processes 
on one host, or when it is not possible to filter precisely 
enough within the kernel. 

This access control mechanism does not in itself 
protect against malicious or erroneous processes at¬ 
tempting to divert packets; it only works when 
processes play by the rules. In the research environ¬ 
ment for which the packet filter was developed, this has 
not been a problem, especially since there are many 
other ways to eavesdrop on an Ethernet. An earlier 
version of the packet filter did provide some security by 
restricting the use of high-priority filters to certain 
users, allowing these users first rights to all packets, but 
this mechanism went unused. 

Because typical networks are easily tapped, most 
proposals for secure communication rely on encryption 
to protect against eavesdropping. If packets are 
encrypted, some header fields must be transmitted in 
cleartext to allow demultiplexing; this is not peculiar to 
use of the packet filter, especially if encryption is on a 
per-process basis. 

3.3. Control and status information 

The user can control the packet filter’s action in a 
variety of ways, by specifying: the filter to be as¬ 
sociated with a packet filter port; the timeout duration 
for blocking reads (or optionally, immediate return or 
indefinite blocking); the signal, if any, to be delivered 
upon packet reception; and the maximum length of the 
per-port input queue. 






Information provided by the packet filter to programs 
includes: the type of the underlying data-link layer; the 
lengths of a data-link layer address and of a data-link 
layer header; the maximum packet size for the data- 
link; the data-link address for incoming packets; and the 
address used for data-link layer broadcasts, if one ex¬ 
ists. 

The user can also ask that each received packet be 
marked with a timestamp and a count of the number of 
packets lost due to queue overflows in the network in¬ 
terface and in the kernel. 

4. Implementation 

The packet filter is implemented in 4.3BSD Unix as a 
“character special device” driver. Just as the Unix 
terminal driver is layered above communications device 
drivers to provide a uniform abstraction, the packet fil¬ 
ter is layered above network interface device drivers. 
As with any character device driver, it is called from 
user code via open , close , read , write , and ioctl system 
calls. The packet filter is called from the network inter¬ 
face drivers upon receipt of packets not destined for 
kernel-resident protocols. 

Most of the complexity in the implementation is in¬ 
volved in bookkeeping and in managing asynchrony. 
When a packet is received, it is checked against each 
filter, in order of decreasing priority, until it is accepted 
or until all filters have rejected it (see figure 4-1). The 
filter interpreter is straightforward, but must be care¬ 
fully coded since its inner loop is quite busy. It simply 
iterates through the “instruction words” of a filter 
(there are no branch instructions), evaluating the filter 
predicate using a small stack. Wlien it reaches the end 
of the filter, or a short-circuit conditional is satisfied, or 
an error is detected, it returns the predicate value to 
indicate acceptance or rejection of the packet. 


Accepted := false; 

for priority := MaxPriority downto 
MinPriority do 

for i := FirstFilter[priority] to 
LastFilter[priority] do 
if Apply(Filter[i], rcvd_pkt) 

= MATCH then 

Deliver (Port [i] , rcvd_j?kt) ; 
Accepted := true; 
end; 
end; 
end; 

if not Accepted then 
Drop(rcvd_pkt); 

end; 


Figure 4-1: Pseudo-code for filter application loop 


The packet filter module is about 2000 lines of 
heavily-commented C source code (under 6K bytes of 
Vax machine code); each of the network interface 
device drivers must be modified with a few dozen lines 
of linkage code. Aside from this, the packet filter re¬ 
quires no modification of the Unix kernel. Because it is 
well-isolated from the rest of the kernel, it is easily 
ported to different Unix implementations. Ports have 
been made to the Sun Microsystems Inc. operating sys¬ 
tem, which is internally quite similar to 4.2BSD, and to 
the Ridge Operating System (ROS) of Ridge Com¬ 
puters, Inc. ROS is a message-based operating system 
with inexpensive processes [1]; its internal structure is 
distinctly different from that of Unix. The packet filter 
has also been ported to Pyramid Technology’s Unix 
system, with minor modification for use in a multi¬ 
processor. It appears to be relatively easy to port the 
packet filter to a variety of operating systems; this in 
turn makes it possible to port user-level networking 
code without further kernel modifications. 

5. Uses of the packet filter 

The packet filter is successful because it provides a 
useful facility with adequate performance. Section 6 
provides quantitative measures of performance; in this 
section we consider qualitative utility. 

The primary goal of the packet filter is to simplify the 
development and improvement of networking software 
and protocols. Since networking software is often in a 
continual state of development, anything that speeds 
debugging and modification reduces the mismatch be¬ 
tween the software and the networking environment. 
This is especially important for the experimental 
development of new protocols. Similarly, since operat¬ 
ing systems are continually changing, decoupling net¬ 
work code from the rest of the system reduces the risk 
of “software rot.” 

The remainder of this section describes examples 
demonstrating how the packet filter has been of prac¬ 
tical use. 

5.1. Pup protocols 

The Pup [2] protocol suite includes a variety of ap¬ 
plications using both datagram (request-response) and 
stream transport protocols. At Stanford, almost all of 
the Pup protocols were implemented for Unix, based 
entirely on the packet filter. Although Pup, as an ex¬ 
perimental architecture, has some notable flaws, for 
about five years this implementation served as the 
primary link between Stanford’s Unix systems and 
other campus hosts and workstations. Pup is still in 
relatively heavy use in a number of organizations, most 
of which have used the Stanford implementation. 

The experience with Pup has shown the value of 
decoupling the networking implementation from the 
Unix kernel. Not only did this make it possible to 
develop the Pup code without the effort of kernel 




debugging, it also made it possible to modify the kernel 
without having to worry about the integrity of the Pup 
code. When, every few years, a new release of the 
Berkeley Unix kernel became available, it sufficed to 
re-install the kernel module implementing the packet 
filter. The Pup code could then be run, often without 
recompilation, under the new operating system. The 
initial port of the packet filter code from 4.1 BSD to 
4.2BSD took several evenings; for comparison, it took 
six programmer-months to port BBN’s TCP implemen¬ 
tation from 4.1BSD to 4.2BSD[14]. That the BBN 
TCP code is kernel-resident undoubtedly contributed to 
the time it took to port. 

5.2. V-system protocols 

The V-system is a message-based distributed operat¬ 
ing system. As an ongoing research project, it is under 
continual development and revision. The architects of 
the V-system have chosen to design their own 
protocols, to obtain high performance and so that they 
could make use of the multicast feature of Ethernet 
hardware [6], 

Although the V-system is primarily a collection of 
workstations and servers running the V kernel, Unix 
hosts were integrated into the distributed system to 
provide disk storage, compute cycles, mail service, and 
other amenities not available in a new operating system. 
The Unix hosts had to be taught to speak the V-system 
Inter-Kernel Protocol (IKP). Fortunately, the packet 
filter was available for use as the basis of a user-level V 
IKP server process. 

The V IKP is a simple protocol and could have been 
put in the Unix kernel. This, however, would have 
required the V researchers to learn about the details of 
the Unix kernel, to participate in the maintenance of the 
kernel, and to re-install the IKP implementation in each 
new release of the operating system. Instead, they were 
able to devote their attention to research on the topics 
that interested them. One result of this research was the 
VMTP protocol [5], a replacement for the V IKP. Al¬ 
though there is a kernel-resident implementation of 
VMTP for 4.3BSD, the first implementation used the 
packet filter. The user-level implementation allowed 
rapid development of the protocol specification through 
experimentation with easily-modified code. (Section 
6.3 contrasts the performance differences between the 
two VMTP implementations.) 

5.3. RARP 

The Reverse Address Resolution Protocol 
(RARP) [12] was designed to allow workstations to 
determine their Internet Protocol (IP) addresses without 
relying on any local stable storage. One issue in the 
definition of this protocol was whether it should be a 
layer above IP, or a parallel layer. The former leads to 
a chicken-or-egg dilemma; the latter is cleaner but 
raised question of implementability under 4.2BSD. 
With the packet filter, however, a RARP implemen¬ 


tation was easy; the work was done in a few weeks by a 
student who had no experience with network program¬ 
ming, and who had no need to learn how to modify the 
Unix kernel. 

5.4. Network Monitoring 

For the developer or maintainer of network software, 
no tool is as valuable as a network monitor. A network 
monitor captures and displays traces of the packets 
flowing among hosts; a packet trace makes it much 
easier to understand why two hosts are unable to com¬ 
municate, or why performance is not up to par. 

Most commercially-available network monitors 
(including the Excelan LANalyzer [ 11 ], the Network 
General Sniffer [18], and the Communications 
Machinery Corp. LanScan [7]) are stand-alone units 
dedicated to monitoring specific protocols. A network 
monitor closely integrated with a general-purpose 
operating system, running on a workstation, has several 
important advantages over a dedicated monitor: 

• All the tools of the workstation are available for 
manipulating and analyzing packet traces. 

• A user can write new monitoring programs to dis¬ 
play data in novel ways, or to monitor new or 
unusual protocols. 

One of us has been using the packet filter, on a 
MicroVAX-II workstation, as the basis for a variety of 
experimental network monitoring tools. This system 
has sufficient performance to record all packets flowing 
on a moderately busy Ethernet (with rare lapses), and 
more than sufficient performance to capture all packets 
between a pair of communicating hosts. Since one can 
easily write arbitrarily elaborate programs to analyze 
the trace data, and even to do substantial analysis in real 
time, an integrated network monitor appears to be far 
more useful than a dedicated one. (Sun Microsystems’ 
etherfind program is another example of an integrated 
network monitor. It is based on Sun’s Network Inter¬ 
face Tap (NIT) facility, which is similar to the packet 
filter but only allows filtering on a single packet 
field 1 [22].) 

6. Performance 

We measured the performance of the packet filter in 
several ways. We determined the amount of processor 
time spent on packet filter routines, and we measured 
the throughput of protocol implementations based on 
the packet filter. We compared these measurements 
with those for kernel-resident implementations of 
similar protocols, and found that in practice packet- 
filter-based protocol implementations perform fairly 
well. 

All measurements were made using VAX processors 


l S\m expects to include our packet-filtering mechanism in a future 
release of NIT. 



running 4.2BSD or 4.3BSD Unix, using either a 
lOMbit/sec or 3Mbit/sec Ethernet. Note that the packet 
filter coexists with kernel-resident protocol implemen¬ 
tations, without affecting their performance. 

6.1. Kernel per-packet processing time 

One indication of the packet filter’s cost is the kernel 
CPU time required to process an “average” received 
packet. We measured this time for the packet filter, and 
for analogous functions of kernel-resident protocols. A 
4.3BSD Unix kernel was configured to collect the CPU 
time spent in and number of calls made to each kernel 
subroutine. The profiled kernel was run for 28 hours on 
a timesharing VAX-11/780, and gprof[ 13] was used to 
format the data. 

During the profiling period, the system handled 1.3 
million packets. 21% of these packets were processed 
by the packet filter; of the remainder, 69% were IP 
packets and 10% were ARP packets. All per-packet 
processing times reported are for “average” packets 
and “typical” filter predicates. 

Processing times for transmitted packets are about 
the same for either the packet filter or the kernel- 
resident IP implementation; it takes about 1 mSec to 
send a datagram. The packet filter has a slight edge, 
since it does not need to choose a route for the 
datagram or compute a checksum. 

Packet filter: The packet filter spends an average of 
1.57 mSec processing each packet. 41% of this 
time is spent evaluating filter predicates; the 
average packet is tested against 6.3 predicates. We 
derived a crude estimate for the time to process a 
packet: 0.8 mSec + (0.122 * number of predicates 
tested) mSec. The average number of predicates 
tested will normally be somewhat less than half the 
number of active ports, because the priority 
mechanism described in section 3.2 can cause the 
most likely filters to be tested first. 

Kernel-resident EP implementation: The average 
time required to process a received IP packet was 
1.77 mSec. This includes all protocol processing 
up to the TCP and UDP layers; if only the IP layer 
processing is counted, the average packet requires 
about 0.49 mSec. This means that the kernel- 
resident IP layer is about three times faster than the 
packet filter at processing an average packet. 

6.2. Total per-packet processing time 

The kernel profile does not account for the entire cost 
of handling packets. We measured actual packet rates 
into and out of user processes on a microVax-II running 
Ultrix 1.2, using a synthetic load. The results for 
packet reception are included in tables 6-8 and 6-9 in 
section 6.5. 

Although sending datagrams via the packet filter 
costs less than sending an unchecksummed UDP 


datagram of the same size (see table 6-1), we estimate 
that this is still about twice the cost for the kernel to 
send a datagram on its own. For packets that carry no 
useful data (acknowledgements, for example) user-level 
protocol implementations pay this additional penalty. 


Total 

Elapsed time per packet sent 

packet 

via 

via 

size 

packet filter 

UDP 

128 bytes 

1.9 mSec 

3.1 mSec 

1500 bytes 

3.6 mSec 

4.9 mSec 


Table 6-1: Cost of sending packets 


6.3. VMTP performance 

The only interesting protocol for which there is both 
a packet-filter based implementation and a kernel- 
resident implementation is VMTP [5]. This provides a 
basis for a direct measurement of the cost of user-level 
implementation; while there are minor differences in 
the actual protocols implemented, and the two im¬ 
plementations are not of precisely equal quality, they 
follow essentially the same pattern of packet transport. 
All these measurements, unless noted, were carried out 
using microVax-II processors, 4.3BSD Unix, and a 
lOMbit/sec Ethernet. In each case, both ends of the 
transfer used identical protocol implementations. 

We measured the cost for a minimal round-trip 
operation (reading zero bytes from a file). The results, 
shown in table 6-2, indicate that the penalty for user- 
level implementation is almost exactly a factor of two. 
On this measurement, the Unix kernel implementation 
of VMTP is quite close to the V kernel implementation, 
indicating that there is no obvious inefficiency in the 
Unix kernel implementation. 


VMTP 

Implementation 

elapsed time/operation 

Packet filter 

14.7 mSec 

Unix kernel 

7.44 mSec 

V kernel 

7.32 mSec 


Table 6-2: Relative performance of VMTP 
for small messages 


We also measured the cost for transferring bulk data 
using VMTP. This was done by repeatedly reading the 
same segment of a file, which therefore stayed in the 
file system buffer cache; consequently, the measured 
rates should be nearly independent of disk I/O speed. 
(In each trial about 1 Mb was transferred.) We also 
measured TCP performance, for comparison; note that 
TCP checksums all data, whereas these implemen¬ 
tations of VTMP do not. The results, shown in table 
6-3, show that in this case the penalty for user-level 





implementation of VMTP is almost exactly a factor of 
three. 


Implementation 

Rate 

Packet filter 

112 Kbytes/sec 

Unix kernel VMTP 

336 Kbytes/sec 

V kernel VMTP 

278 Kbytes/sec 

Unix kernel TCP 

222 Kbytes/sec 


Table 6-3: Relative performance of VMTP 
for bulk data transfer 


The packet-filter based implementation measured in 
table 6-3 uses received-packet batching. Table 6-4 
shows that batching improves throughput by about 75% 
over identical code that reads just one packet per sys¬ 
tem call; the difference cannot be entirely due to 
decreased system call overhead, but may reflect reduc¬ 
tions in context switching and dropped packets. 


Batching 

Rate 

Yes 

112 Kbytes/sec 

No 

64 Kbytes/sec 


Table 6-4: Effect of received-packet batching 
on performance 

We also tried to measure the cost of a user-level 
demultiplexing process, by simulating it within the 
client VTMP implementation. This is done by using an 
extra process to receive packets, which are then passed 
to the actual VMTP process via a Unix pipe. (In this 
case, the server process was not modified.) Table 6-5 
shows that user-level demultiplexing has a small cost 
(20% greater latency) for short messages, but decreases 
bulk throughput by more than a factor of four (much of 
this is attributable to the poor IPC facilities in 4.3BSD). 



Elapsed time 


Demultiplexing 

per minimal 


done in 

operation 

Bulk rate 

Kernel 

14.72 

112 Kbytes/sec 

User process 

18.08 

25 Kbytes/sec 


Table 6-5: Effect of user-level demultiplexing 
on performance 


6.4. Byte-stream throughput 

We compared the performance of a Pup/BSP (Byte- 
Stream Protocol) implementation using the packet filter 
with that of the IP/TCP [20] implementation in the 
4.3BSD kernel. These measurements were carried out 
using microVax-II processors, 4.3BSD Unix, and a 
lOMbit/sec Ethernet. 


Table 6-6 shows the rates at which the two im¬ 
plementations can transfer bulk data from process to 
process. TCP is faster by almost a factor of six. When 
used to implement a File Transfer Protocol (FTP), TCP 
slows by a factor of two if the source of data is a disk 
file, but the BSP throughput remains unchanged, in¬ 
dicating that network performance is the rate-limiting 
factor for BSP file transfer. 


Implementation 

Rate 

Packet filter BSP 

38 Kbytes/sec 

Unix kernel TCP 

222 Kbytes/sec 


Table 6-6: Relative performance of 

stream protocol implementations 


Pup (hence BSP) allows a maximum packet size of 
568 bytes, whereas TCP in 4.3BSD uses 1078-byte 
packets and so sends only half as many; we found that 
if TCP is forced to use the smaller packet size, its per¬ 
formance is cut in half. After this correction, TCP 
throughput is still three times that of BSP; most of dif¬ 
ference is attributable to the cost of BSP’s user-level 
implementation. This is consistent with the factor-of- 
two difference we measured for VMTP. 

We also measured performance for Telnet (remote 
terminal access) 2 . A program on the ‘‘server” host 
(Vax-11/780) prints characters which are transmitted 
across the network and displayed at the “user” host. 
The results are shown in table 6-7. The “Output rate” 
column shows the overall throughput, in characters per 
second, for each configuration. 


Telnet 

protocol 

Network 

bandwidth 

Output 

rate 

Pup/BSP 

10 Mbit/sec 

1635 

IP/TCP 

10 Mbit/sec 

1757 

Pup/BSP 

3 Mbit/sec 

878 

IP/TCP 

3 Mbit/sec 

933 


Table 6-7: Relative performance of Telnet 

The first two rows of the table show throughput using 
an MC68010-based workstation capable of displaying 
about 3350 characters per second. The achieved 
throughput is about half that, varying only slightly ac¬ 
cording to whether TCP or BSP (and thus the packet 
filter) is used. The last two rows, measured with 
characters displayed on a 9600 baud terminal, show al¬ 
most no difference between BSP and TCP performance. 
These output rates are clearly limited by the display 
terminal, not by network performance. 


2 This test was done under 4.2BSD. 







In summary, a kernel-resident implementation of a 
stream protocol such as VMTP or BSP appears to be 
about two or three times as fast as an implementation 
based on the packet filter. In many applications, the 
actual performance difference may be much smaller; 
the packet-filter implementation of VMTP is only 40% 
slower than the kernel-resident TCP when used for file 
transfer. The VMTP and BSP implementations are 
quite useful in practice; disks and terminals are more 
often serious bottlenecks than the packet filter. 

6.5. Costs of demultiplexing outside the kernel 

We asserted in section 2 that using a user-level 
process to demultiplex received packets to other 
processes would result in poor performance. In section 
6.3 we showed that this appears to be true, especially 
for bulk-data transfer. In this section, we analyze the 
additional cost using measurements of Ultrix 1.2; the 
measurements are inspired by those made by 
McKusick, Karels, and Leffler [16]. 

6.5.1. Analytical model 

If a demultiplexing process is used, each received 
packet results in at least two context switches: one into 
the demultiplexing process and one into the receiving 
process 3 . If the system has other active processes, an 
additional context switch to an unrelated process may 
occur, when the receiving process blocks waiting for 
the next packet. 

With direct delivery of received packets, in the best 
case the receiving process will never be suspended, and 
no context switches take place. In the worst case, with 
other active processes, a received packet will cause two 
context switches. 

Either mechanism requires at least one data transfer 
between kernel and process. Since Unix does not sup¬ 
port memory sharing, the demultiplexing process re¬ 
quires two additional data transfers to get the packet 
into the final receiving process. 

6.5.2. Cost of overhead operations 

Benchmarks indicate that a MicroVAX-II running 
Ultrix 1.2 requires about 0.4 mSec of CPU time to 
switch between processes, and about 0.5 mSec of CPU 
time to transfer a short packet between the kernel and a 
process. Therefore, we predict that receiving a short 
packet using a demultiplexing process should take at 
least 2.3 mSec while for the packet filter, these over¬ 
head costs may be as low as 0.5 mSec per packet; the 
difference increases for longer packets because data 
copying requires about 1 mSec/Kbyte. 


3 We assume that no batching of packets takes place; this assump¬ 
tion breaks down when packets arrive faster than the system can 
switch contexts. 


6.5.3. Measured costs 

These costs are not the only ones associated with 
receiving a packet; they are the ones that are affected by 
the use of user-level demultiplexing. We measured the 
actual elapsed time required to receive packets of 
various sizes; the “demultiplexing process” receives 
packets from the network and passes them to a second 
process via a Unix pipe. The results are shown in table 
6-8. The additional cost of user-level demultiplexing 
agrees fairly closely with our predication. 



Elapsed time if demultiplexing 

Packet 

done in 

done in 

size 

kernel 

user process 

128 bytes 

2.3 mSec 

5.0 mSec 

1500 bytes 

4.0 mSec 

9.0 mSec 


Table 6-8: Per-packet cost of 

user-level demultiplexing 


Since received-packet batching, as we saw in section 

6.3, can amortize the costs of context-switching over 
many packets, we repeated our measurements with 
batching enabled; the batch size was hard to control but 
the results are about the same for four or more packets 
per batch. The results are shown in table 6-9; batching 
clearly reduces the penalty associated with user-level 
demultiplexing, but the difference remains significant. 



Elapsed time if demultiplexing 

Packet 

done in 

done in 

size 

kernel 

user process 

128 bytes 

1.9mSec 

2.4 mSec 

1500 bytes 

3.5 mSec 

5.9 mSec 


Table 6-9: Per-packet cost of user-level 
demultiplexing with 
received-packet batching 

The measurements in tables 6-8 and 6-9 were made 
without any real decision-making on the part of the 
demultiplexer. Before we condemn user-level demul¬ 
tiplexing on the basis of its high overhead, we must 
show that the cost of interpreting packet filters in the 
kernel does not dwarf the benefit of avoiding context 
switches (presumably, a user-level demultiplexer could 
make decisions at least as efficiently and possibly more 
so). We measured the cost of interpreting filter 
programs of various lengths; the results are shown in 
table 6-10. (Batching was enabled and all packets were 
128 bytes long.) It usually takes two or three filter 
instructions to test one packet field; even with rather 
long filters (21 instructions) the additional cost for filter 
interpretation is less than the cost of user-level demul¬ 
tiplexing if no more than three such long filters are 
applied to an incoming packet before one filter accepts 
it. 





Filter length 
(instructions) 

Elapsed time 
per packet 

0 

1.9 mSec 

1 

2.0 mSec 

9 

2.2 mSec 

21 

2.5 mSec 


Table 6-10: Cost of interpreting packet filters 


For filters using short-circuit conditionals, the break¬ 
even point is closer to an average of about ten filters 
before acceptance, which should occur when more than 
twenty filters are active. This means that even if one 
assumes zero cost for decision-making in a user-level 
demultiplexer, the break-even point comes with twenty 
different processes using the network. For packets 
longer than 128 bytes, the break-even point comes with 
even more active processes. 

In summary, kernel demultiplexing performs sig¬ 
nificantly better than user-level demultiplexing for a 
wide range of situations. This advantage disappears 
only if a very large number of processes are receiving 
packets. 

7. Problems and possible improvements 

Since its beginnings in early 1980, the packet filter 
has often been revised to support additional applica¬ 
tions or provide better performance. There is still room 
for improvement. 

The filter language described in section 3 only allows 
the user to specify packet fields at constant offsets from 
the beginning of a packet. This has been adequate for 
protocols with fixed-format headers (such as Pup), but 
many network protocols allow variable-format headers. 
For example, since the IP header may include optional 
fields, fields in higher layer protocol headers are not at 
constant offsets. The current packet filter can be made 
to handle non-constant offsets only with considerable 
awkwardness and inefficiency; the filter language needs 
to be extended to include an “indirect push” operator, 
as well as arithmetic operators to assist in addressing- 
unit conversions. 

The current filter mechanism deals with 16-bit 
values, requiring multiple filter instructions to load 
packet fields that are wider or narrower. It is possible 
that direct support for other field sizes would improve 
filter-evaluation efficiency. The existing read-batching 
mechanism clearly improves performance for bulk data 
transfer; a write-batching option (to send several pack¬ 
ets in one system call) might also improve performance. 

In addition to these problems, which may be regarded 
as deficiencies in the abstract interface, there is room 
for improvement in the existing implementation. 
During evaluation of each filter instruction, the inter¬ 
preter verifies that the instruction is valid, that it doesn’t 


overflow or underflow the evaluation stack, and that it 
doesn’t refer to a field outside the current packet. Since 
the filter language does not include branching instruc¬ 
tions, all these tests can be performed ahead of time 
(except for indirect-push instructions); this might sig¬ 
nificantly speed filter evaluation. Even more speed 
could be gained by compiling filters into machine code, 
at the cost of greatly increased implementation com¬ 
plexity. Finally, with a redesigned filter language it 
might be possible to compile the set of active filters into 
a decision table, which should provide the best possible 
performance. 

Idiosyncrasies of the 4.3BSD kernel create other in¬ 
efficiencies. For example, because 4.3BSD network 
interface drivers strip the data-link layer header from 
incoming packets, the packet filter may be spending a 
significant amount of time to restore these headers. 
Also, in order to mark each packet with a unique times¬ 
tamp, the packet filter calls a kernel subroutine called 
microtime ; on a VAX-11/780, this costs about 70 uSec, 
probably more than the timestamp is worth. 

8. Summary 

The performance of the packet filter is clearly better 
than that of a user-level demultiplexer, and the perfor¬ 
mance of protocol code based on the packet filter is 
clearly worse than that of kernel-resident protocol code. 
Since the packet filter is just as flexible as a user-level 
demultiplexer, we believe that in systems where 
context-switching has a substantial cost, it is the right 
basis for implementing network code outside the kernel. 

Are the advantages of user-level network code, even 
with the packet filter, worth the extra cost? Our ex¬ 
perience has convinced us that in many cases, it is. The 
performance of such code is quite acceptable, and it 
greatly eases the task of developing protocols and their 
implementations. The packet filter appears to put just 
enough mechanism in the kernel to provide decent per¬ 
formance, while retaining the flexibility of a user-level 
demultiplexer. 
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